home *** CD-ROM | disk | FTP | other *** search
Wrap
Text File | 2001-04-17 | 54.3 KB | 1,190 lines
PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) NNNNAAAAMMMMEEEE r10k_evcntrs, r10k_event_counters, r10k_counters - Programming the processor event counters DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN The R10000 and R12000 processors include counters that can be used to count the frequency of events during the execution of a program. The information returned by the counters can be helpful in optimizing the program. The ppppeeeerrrrffffeeeexxxx(1) and ssssssssrrrruuuunnnn(1) commands provide convenient interfaces to hardware counter information. TTTTHHHHEEEE CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS The R10000 processor supplies two performance counters for counting certain hardware events. Each counter can track one event at a time and there are a choice of sixteen events per counter. There are also two associated control registers which are used to specify which event the relevant counter is counting. The R12000 processor supplies two performance counters for counting hardware events. Each counter can track one event at a time, and you can choose among 32 events per counter. Using performance counters in a machine with both R10000 and R12000 processors is currently undefined. Each counter is a 32-bit read / write register and is incremented by one each time the event specified in its associated control register occurs. Furthermore, the control registers allow one to indicate that the events are only counted in a specific mode. The modes may be user mode or several choices of kernel mode, or some combination of kernel and user mode. The counters can optionally assert an interrupt upon overflow, which is defined to be when the most significant bit of one of the counter registers (bit 31) becomes set. If such an overflow interrupt is enabled for that event in the associated control register, then the interrupt will be presented to the cpu. Whether the interrupt is asserted or not the counting of events will continue after overflow. TTTTHHHHEEEE CCCCOOOONNNNTTTTRRRROOOOLLLL RRRREEEEGGGGIIIISSSSTTTTEEEERRRRSSSS The format of the control registers is as follows: 31 8 4 3 2 1 0 ___________________________________________________________________ | 0 | Event | IE | U | S | K | EXL | ___________________________________________________________________ Bit 4 is the interrupt enable bit, which specifies whether overflows for the specified event will generate interrupts or not. Bits 3 through 0 specify either the mode the event is counted in or the count enable bits. These bits will enable counting when they match the equivalent KSU settings in the status register of the R10000 or R12000. That is: PPPPaaaaggggeeee 1111 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) U bit <----> KSU = 2, EXL = 0, ERL = 0 (user mode) S bit <----> KSU = 1, EXL = 0, ERL = 0 (supervisor mode, not supported) K bit <----> KSU = 0, EXL = 0, ERL = 0 (kernel mode) EXL bit <---> EXL = 1, ERL = 0 (transient kernel mode) ERL is a field in the status register on coprocessor 0. It is set when the processor hits an error and is forced into kernel mode. If the KSU bits in the status register are 2, and the ERL and EXL bits are both off, events enabled with the U bit will be counted. In this way, a program that intends to use the performance counters directly must specify the events that are to be counted and the modes in which they are to be counted. EEEEVVVVEEEENNNNTTTTSSSS The following events can be tracked by the performance counters on R10000 processors: 0=cycles Incremented on each clock cycle. 1=issued instructions Incremented each time an instruction is issued to ALU, FPU or load/store units. 2=issued loads Incremented when a load, prefetch, or synchronization instruction is issued. 3=issued stores Incremented when a store instruction is issued. 4=issued store conditionals Incremented when a conditional store instruction is issued. 5=failed store conditionals Incremented when a store-conditional instruction fails. A failed store-conditional instruction will, in the normal course of events, graduate; so this event represents a subset of the store conditional instructions counted on event 20 (graduated store conditionals). 6=Decoded branches Incremented when a branch is decoded (for revision 2.x processors) or resolved (for revision 3.x processors). 7=Quadwords written back from secondary cache Incremented when data is written back from secondary cache to the system interface. PPPPaaaaggggeeee 2222 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) 8=correctable secondary cache data array ECC errors Incremented when single-bit ECC erros are detected on data read from secondary cache. 9=primary instruction cache misses Incremented when the next instruction is not in primary instruction cache. 10=secondary instruction cache misses Incremented when the next instruction is not in secondary instruction cache. 11=instruction misprediction from secondary cache way prediction table Incremented when the secondary cache way mispredicted an instruction. 12=external interventions Incremented when an external intervention is entered into the Miss Handling Table (MHT), provided that the intervention is not an invalidate type. 13=external invalidations Incremented when an intervention is entered into the Miss Handling Table, provided that the intervention is an invalidate type. 14=virtual coherency conditions or ALU/FPU completion cycles Incremented on virtual coherency conditions (on revision 2.x R10000 processors) or on ALU/FPU functional unit completions cycles (on revision 3.x R10000 processors). 15=graduated instructions Incremented when an instruction is graduated. 16=cycles Incremented on each clock cycle. 17=graduated instructions Incremented when an instruction is graduated. 18=graduated loads Incremented on a graduated load, prefetch, or synchronization instruction. 19=graduated stores Incremented on a graduated store instruction. 20=graduated store conditionals Incremented when a graduated conditional store instruction is issued. PPPPaaaaggggeeee 3333 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) 21=graduated floating-point instructions Incremented when a graduated floating-point instruction is issued. 22=quadwords written back from primary data cache Incremented when data is written back from primary data cache to secondary cache. 23=TLB misses Incremented when a translation lookaside buffer (TLB) refill exception occurs. 24=mispredicted braches Incremented when a branch is mispredicted. 25=primary (L1) data cache misses. Incremented when the next data item is not in primary data cache. 26=secondary (L2) data cache misses. Incremented when the next data item is not in secondary data cache. 27=data mispredicted from secondary cache way prediction table Incremented when the secondary cache way mispredicted a data item. 28=external intervention hits in secondary cache (L2) Set as follows when an external intervention is determined to have hit in secondary cache: 00 Invalid, ho hit detected 01 Clean, shared 10 Clean, exclusive 11 dirty, exclusive 29=external invalidation hits in secondary cache (L2) Set when an external invalidate request is determined to have hit in the secondary cache. Its value is equivalent to that described for event 28. 30=store/fetch exclusive to clean block in secondary cache (L2) Incremented on each cycle by the number of entries in the Miss Handling Table (MHT) waiting for a memory operation to complete. 31=store/fetch exclusive to shared block in secondary cache (L2) Incremented when an update request is issued for a line in the secondary cache. If the line is in the clean state, the counter is incremented by one. If the line is in the shared state, the counter is incremented by two. The conditional counting mechanism can be used to select whether one, both, or neither of these events is chosen. Note that the definition of events 6 and 14 on counter 0 differ depending on the R10000 chip revision. The chip revision can be determined via the command hhhhiiiinnnnvvvv(1). PPPPaaaaggggeeee 4444 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) The following events can be tracked by the performance counters on R12000 processors: 0=cycles Incremented on each clock cycle. 1=decoded instructions Incremented by the total number of instructions decoded on the previous cycle. Since decoded instructions may later be killed (for a variety of reasons), this count reflects the overhead due to incorrectly speculated branches and exception processing. 2=decoded loads Incremented when a load instruction was decoded on the previous cycle. Prefetch, cache operations, and synchronization instructions are not included in the count of decoded loads. 3=decoded stores Incremented if a store instruction was decoded on the previous cycle. Store conditionals are included in this count. 4=mishandling table occupancy Incremented on each cycle by the number of currently valid entries in the Miss Handling Table (MHT). The MHT has five entries. Four entries are used for internally generated accesses; the fifth entry is reserved for externally generated events. All five entries are included in this count. See event 8 for a related definition. 5=failed store conditionals Incremented when a store-conditional instruction fails. A failed store-conditional instruction will, in the normal course of events, graduate; so this event represents a subset of the store-conditional instructions counted on event 20 (graduated store-conditionals). 6=resolved conditional branches Incremented both when a branch is determined to have been mispredicted and when a branch is determined to have been correctly predicted. When this determination of the accuracy of a branch- prediction is known, the branch is known as "resolved." This counter correctly reflects the case of multiple floating-point conditional branches being resolved in a single cycle. 7=Quadwords written back from secondary cache Incremented on each cycle that the data for a quadword is written back from secondary cache to the system interface unit. 8=correctable secondary cache data array ECC errors Incremented on the cycle following the correction of a single-bit error in a quadword read from the secondary cache data array. PPPPaaaaggggeeee 5555 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) 9=primary instruction cache misses Incremented one cycle after an instruction fetch request is entered into the Miss Handling Table. 10=secondary instruction cache misses Incremented the cycle after a refill request is sent to the system interface module of the CPU. This is normally just after the L2 tags are checked and a miss is detected, but it may be delayed if the system interface module is busy with another request. 11=instruction misprediction from secondary cache way prediction table Incremented when the secondary cache control begins to retry an access because it hit in the unpredicted way, provided the access that initiated the access was an instruction fetch. 12=external interventions Incremented on the cycle after an intervention is entered into the Miss Handling Table, provided that the intervention is not an invalidated type. 13=external invalidations Incremented on the cycle after an intervention is entered into the Miss Handling Table, provided that the intervention is an invalidate type. 14=ALU/FPU progress cycles Incremented on the cycle after either ALU1, ALU2, FPU1, or FPU2 marks an instruction as done. 15=graduated instructions Incremented by the number of instructions that were graduated on the previous cycle. Integer multiply and divide instructions each count two graduated instructions because they occupy two entries in the active list. 16=executed prefetch instructions Incremented on the cycle after a prefetch instruction does its tag- check, regardless of whether a data cache line refill is initiated. 17=prefetch primary data cache misses Incremented on the cycle after a prefetch instruction does its tag- check and a refill of the corresponding data cache line is initiated. 18=graduated loads Incremented by the number of loads that graduated on the previous cycle. Prefetch instructions are included in this count. Up to four loads can graduate in one cycle. 19=graduated stores Incremented on the cycle after a store graduates. Only one store can graduate per cycle. Store conditionals are included in this count. PPPPaaaaggggeeee 6666 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) 20=graduated store conditions Incremented on the cycle following the graduation of a store- conditional instruction. Both failed and successful store- conditional instructions are included in this count; so sucessful store-conditionals can be determined as the difference between this event and event 5 (failed store-conditionals). 21=graduated floating-point instructions Incremented by the number of floating-point instructions that graduated on the previous cycle. There can be 0 to 4 such instructions. 22=quadwords written back from primary data cache Incremented on each cycle that a quadword of data is valid and is written from primary data cache to secondary cache. 23=TLB misses Incremented on the cycle after the translation lookaside buffer (TLB) miss handler is invoked. 24=mispredicted branches Incremented on the cycle after a branch is _r_e_s_t_o_r_e_d because it was mispredicted. 25=primary data cache misses Incremented one cycle after a request is entered into the SCTP logic, provided that the request was initially targeted at the primary data cache. Such requests fall into three categories: 1) Primary data cache misses. 2) Requests to change the state of secondary and primary data cache lines from clean to dirty ("update" requests) due to stores that hit a clean line in the primary data cache. 3) Requests initiated by cache operation instructions. 26=secondary data cache misses Incremented the cycle after a refill request is sent to the system interface module of the CPU. This is normally just after the L2 tags are checked and a miss is detected, but it can be delayed if the system interface module is busy with another request. 27=data misprediction from secondary cache way prediction table Incremented when the secondary cache control begins to retry an access because it hit in the unpredicted way. The counter is incremented only if access that initiated the access was not an instruction fetch. PPPPaaaaggggeeee 7777 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) 28=state of external intervention hits in secondary cache Set on the cycle after an external intervention is determined to have hit in the secondary cache. The value of the event is equal to the state of the secondary cache line that was hit. Setting a performance control register to select this event has a special effect on the conditional counting behavior. If event 28 or 29 is selected, the sense of the "Negated conditional counting" bit is inverted. See the description of conditional counting for details. The values are: 00 Invalid, ho hit detected 01 Clean, shared 10 Clean, exclusive 11 dirty, exclusive 29=state of invalidation hits in secondary cache (L2) Set on the cycle after an external invalidate request is determined to have hit in secondary cache. Its value is equivalent to that described for event 28. 30=Miss Handling Table entries accessing memory Incremented on each cycle by the number of entries in the Miss Handling Table (MHT) waiting for a memory operation to complete. It is always less than or equal to the value tracked by counter 4. An entry is considered to begin accessing memory when the cache control logic recognizes that a request must be sent via the SysA/D bus. An entry is included in this count from that point until the entry is removed from the MHT. For example, once the secondary cache tags are checked and an secondary cache miss is recognized, the entry that originated the request is included in this count. It continues to be included until the last word of the refilled line is written into the secondary cache and the MHT entry is removed. Unlike counter 4, the fifth slot of the MHT, which is reserved for externally generated requests, is not included in this count. 31=store/prefetch exclusive to shared block in secondary cache (L2) Incremented on the cycle after an update request is issued for a line in the secondary cache. If the line is in the clean state, the counter is incremented by one. If the line is in the shared state, the counter is incremented by two. The conditional counting mechanism can be used to select whether one, both, or neither of these events is chosen. The kernel maintains 64-bit virtual counters for the user program using the hardware counters. The view of the counters as being 64-bits is maintained through the programming interfaces that use them, even though the actual counters are only 32 bits. Similarly, there are only two hardware counters per CPU, but the programming interface supports the view that there are actually 32 counters. That is, a user program can specify that more than one event per hardware counter is to be counted, up to sixteen events per counter. The kernel will then multiplex the events across clock tick boundaries. So, if a program is tracking more than one event per counter, on every clock tick the kernel will check to PPPPaaaaggggeeee 8888 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) see if it is necessary to switch the events being tracked. If necessary, it will save the counts for the previous events and set up the counters for the next event. Thus, to the program there are 32 64-bit counters available. The performance counters are available to the user program primarily through the ppppeeeerrrrffffeeeexxxx(1) and ssssssssrrrruuuunnnn(1) commands. You can also access the counters through the ////pppprrrroooocccc(4) interface. A limited and more specialized functionality is also provided through the ssssyyyyssssssssggggiiii(2) interface, but this is not intended to be the general interface. Using ppppeeeerrrrffffeeeexxxx, you can select the events to be counted on hardware counters and the executable program to be run. The ppppeeeerrrrffffeeeexxxx command prints the values of the hardware counters following the run. See the ppppeeeerrrrffffeeeexxxx(1) man page for more information. The ssssssssrrrruuuunnnn command is part of the SpeedShop performance analysis package, and it provides input to the WorkShop ccccvvvvppppeeeerrrrffff(1) user interface or, in ASCII format, to the pppprrrrooooffff(1) command. See the various man pages, the _S_p_e_e_d_S_h_o_p _U_s_e_r'_s _G_u_i_d_e, and the _D_e_v_e_l_o_p_e_r _M_a_g_i_c: _P_e_r_f_o_r_m_a_n_c_e _A_n_a_l_y_z_e_r _U_s_e_r'_s _G_u_i_d_e for more information. Through ////pppprrrroooocccc, iiiiooooccccttttlllls allow you to start or stop using the counters, to read the counts in your own counters, or to modify the way the counters are being used. Since this interface specifies a process ID as a parameter, it is possible, in general, for a process to read or manipulate the counters of another process, as long as the process belongs to the same process group or is root. There are also iiiiooooccccttttlllls that allow the program to specify overflow thresholds on a per-event basis and to supply a signal to be sent to the program upon overflow. That is, the fact that an interrupt can be generated whenever a particular counter overflows can be exploited to allow a program to specify a threshold _n for an event such that after _n occurrences of the event an interrupt will be generated. In addition, while the kernel is servicing the counter overflow interrupt, it can perform some user-specified action, such as sending a user-specified signal to the program whenever an overflow is generated or incrementing a PC bucket for profiling. The latter choice is a more specialized functionality and is not part of the general ////pppprrrroooocccc interface. For a process using the counters in user mode, the control block for the counters is kept in the u-area. Thus, once the process forks, the child acquires the same state of the counters as the parent, which implies that the next time the child runs the performance counters will be run for the child, tracking the same events as its parent. Therefore, the counter values are zeroed for the child upon fork so that at a later time the child's counters will accurately depict the activity of the child. For this reason, it is possible for the parent to fork and then wait for the child to exit. When the child exits, if the kernel sees that the parent is waiting for the child it will add the child's 64-bit counters to those of the parent, and the parent will thus have the event trace of the PPPPaaaaggggeeee 9999 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) child. Other methods for a parent to acquire a child's counters are discussed with the PPPPIIIIOOOOCCCCSSSSAAAAVVVVEEEECCCCCCCCNNNNTTTTRRRRSSSS iiiiooooccccttttllll. OOOOppppeeeerrrraaaattttiiiioooonnnn MMMMooooddddeeeessss ffffoooorrrr tttthhhheeee PPPPeeeerrrrffffoooorrrrmmmmaaaannnncccceeee CCCCoooouuuunnnntttteeeerrrrssss There are two basic modes that the counters are used in, user mode and system mode. Using them in user mode allows the counters to be shared among any number of user programs. In this mode the kernel saves and restores the counts and state of the counters across context switch boundaries. System mode is defined when a user with root privileges uses the counters in kernel mode (user mode and/or EXL mode may also be specified, but kernel mode is essential). In this mode there are no context switch boundaries and so other programs will not be able to use the counters when they are in use in system mode. Therefore, when the counters are already in use in user mode, a program which attempts to use them in system mode will fail with EBUSY since the two modes cannot co-exist (unless certain commands are employed to force releasing of the counters in user mode and the acquiring of them in system mode- to be discussed later). Likewise, if the counters are in use in system mode, any program attempting to use the counters will fail with EBUSY (root-level or otherwise). The approach taken to these two operating modes is that system mode has a higher priority. For this reason there is a syssgi command to forcibly acquire the counters in system mode. Any current users of the counters on any cpu will be forced to release them. And any users of the counters who are not currently running will not be able to acquire them when they run again. This latter situation holds at all times. That is, there may be several programs sharing the counters in user mode. If at any moment they happen to all be switched out, the counters are temporarily free. At this point it is possible for a super-user to acquire the counters in system mode. Then, when the other programs are run again, they won't be able to acquire the counters since they are in use in system mode. Since this program will then be run at this point without the intended event counting, the kernel will arrange it such that this program will not use the counters again, unless they are explicitly restarted. This is because the values in the counters are no longer representative of the program. To re-iterate, a root-level program may receive EBUSY from the kernel if it tries to acquire the counters in system mode through /proc and they are actively in use at the time of the system call. If they are in use in user mode by other programs but those programs are not running at the time of the system call, then the counters will be successfully acquired in system mode and the other programs will not be able to acquire them again- the kernel will not try to start up the counters for those other programs again. In order to make this situation visible to the program, a _g_e_n_e_r_a_t_i_o_n number is employed to reflect the current state of the counters. In this case, whenever the kernel does turn off the use of the counters for a PPPPaaaaggggeeee 11110000 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) program because the mode of operation has switched from user mode to system mode, the _g_e_n_e_r_a_t_i_o_n number for the counters for the user programs will be increased. Thus, subsequent reads of the counters will return the new number and should signal the program that the counter values are not to be trusted. The number will be discussed in greater detail later. To support using the counters in system mode, each cpu has its own control block for the counters, pointed to in its private area. There is also a global counter control block which maintains counter state for the entire system. When the counters are being used in system mode they are not read and stored across context switch boundaries. In fact, unless they are explicitly read by a program, the counters are not read by the kernel until there is an overflow interrupt. When this occurs the cpu on which the interrupt occurs updates its own private virtual counters, no changes are made to the global counter control block. When the counters are read in system mode via PIOCGETEVCTRS through /proc, the per-cpu counters are all added together into the global counters so that the global counters represent the sum total of the counted events for the entire system. This same coalescing of the per-cpu counters happens when the counters are released. Note that it is also possible to read a particular cpu's counters via the syssgi HWPERF_GET_CPUCNTRS command. /proc Commands for the Performance Counters To support the /proc interface for the counters, there are several data structures defined in /usr/include/sys/hwperftypes.h that are used to either pass parameters with the calls or to receive data back from the kernel. struct hwperf_ctrlreg { ushort_t hwp_ev :11, /* event counted */ hwp_ie :1, /* overflow intr enable */ hwp_mode:4; /* user/kernel/EXL */ }; typedef union { short hwperf_spec; struct hwperf_ctrlreg hwperf_creg; } hwperf_ctrl_t; typedef struct { hwperf_ctrl_t hwp_evctrl[HWPERF_EVENTMAX]; } hwperf_eventctrl_t; Each event is described to the kernel through an hwperf_ctrl_t. Where relevant, the iiiiooooccccttttlllls take the address of an hwperf_eventctrl_t, the array of 32 hwperf_ctrl_t's. If the user is not interested in an event, then PPPPaaaaggggeeee 11111111 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) care must be taken to ensure that the corresponding element in this array is zero. For a user to gain access to the counters, it must indicate which events are of interest and how they are to be counted; whether overflow thresholds are to be used to generate overflow interrupts or not, and what those thresholds are per event; and what signal the user program would like to receive from the kernel upon overflow interrupt. All of this information is conveyed with the structure hwperf_profevctrarg_t: typedef struct hwperf_profevctrarg { hwperf_eventctrl_t hwp_evctrargs; int hwp_ovflw_freq[HWPERF_EVENTMAX]; int hwp_ovflw_sig; /* SIGUSR1,2 */ } hwperf_profevctrarg_t; With the above structure as parameter the user program must take care to zero the hwp_ovflw_freq elements for which no overflow thresholds are intended. The hwp_ovflw_sig field is used to tell the kernel which signal the program wants to receive upon overflow interrupt. The acceptable signals are between 1 and 32 (SIG32). This field should be zero if no signals are wanted. The following structure is an array of 32 64-bit virtual counters and is used when a program wants to read the virtual counters of a process: typedef struct { __uint64_t hwp_evctr[HWPERF_EVENTMAX]; } hwperf_cntr_t; It is also possible to read the counters and all of the prusage information of a process in one call. To this end the hwperf_prusage_t is defined: typedef struct hwperf_prusage { timespec_t pu_tstamp; /* time stamp */ timespec_t pu_starttime; /* time process was started */ timespec_t pu_utime; /* user CPU time */ timespec_t pu_stime; /* system CPU time */ __uint64_t pu_minf; /* minor (mapping) page faults */ __uint64_t pu_majf; /* major (disk) page faults */ __uint64_t pu_utlb; /* user TLB misses */ __uint64_t pu_nswap; /* swaps (process only) */ __uint64_t pu_gbread; /* gigabytes ... */ __uint64_t pu_bread; /* and bytes read */ __uint64_t pu_gbwrit; /* gigabytes ... */ __uint64_t pu_bwrit; /* and bytes written */ __uint64_t pu_sigs; /* signals received */ __uint64_t pu_vctx; /* voluntary context switches */ PPPPaaaaggggeeee 11112222 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) __uint64_t pu_ictx; /* involuntary context switches */ __uint64_t pu_sysc; /* system calls */ __uint64_t pu_syscr; /* read() system calls */ __uint64_t pu_syscw; /* write() system calls */ __uint64_t pu_syscps; /* poll() or select() system calls */ __uint64_t pu_sysci; /* ioctl() system calls */ __uint64_t pu_graphfifo; /* graphics pipeline stalls */ __uint64_t pu_graph_req[8]; /* graphics resource requests */ __uint64_t pu_graph_wait[8];/* graphics resource waits */ __uint64_t pu_size; /* size of swappable image in pages */ __uint64_t pu_rss; /* resident set size */ __uint64_t pu_inblock; /* block input operations */ __uint64_t pu_oublock; /* block output operations */ __uint64_t pu_vfault; /* total number of vfaults */ __uint64_t pu_ktlb; /* kernel TLB misses */ cpu_mon_t pu_cpu_mon; /* cpu monitoring stats */ } hwperf_prusage_t; The iiiiooooccccttttlllls available through /proc are the following: _PPPP_IIII_OOOO_CCCC_EEEE_NNNN_EEEE_VVVV_CCCC_TTTT_RRRR_SSSS - Start using the counters for a process, either in user mode or system mode. It initializes the counters for the target process and, if the process is running, starts them. Otherwise, the counters will be started the next time the process is run. Fails with EINVAL if events are specified events improperly, or if an input overflow frequency (threshold) is negative. If supervisor or kernel mode is specified for any of the events and the caller does not have root privileges, it will fail with EPERM. EBUSY may be returned for two possible reasons: (1) the counters are already in use in system mode or, (2) the caller is requesting the counters in system mode and, at the time of the request, the counters are in use in user mode, on at least one cpu (this command will not forcibly acquire the counters for a root process). Returns a positive generation number if successful. _PPPP_IIII_OOOO_CCCC_GGGG_EEEE_TTTT_EEEE_VVVV_CCCC_TTTT_RRRR_SSSS - Read the virtual counters of the target process. The address of an hwperf_cntr_t must be supplied in the call. Returns a positive generation number if successful. PPPPaaaaggggeeee 11113333 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) _PPPP_IIII_OOOO_CCCC_GGGG_EEEE_TTTT_PPPP_RRRR_EEEE_VVVV_CCCC_TTTT_RRRR_SSSS- Read a process's counters in addition to reading all the prusage information associated with the process. The address of an hwperf_prusage_t must be supplied with the call. Returns a positive generation number if successful. _PPPP_IIII_OOOO_CCCC_GGGG_EEEE_TTTT_EEEE_VVVV_CCCC_TTTT_RRRR_LLLL - Retrieve the control information for the process's counters: which events are being counted and the mode they are being counted in. The kernel will copyout an array of 32 event specifiers, so the user must supply an address of an hwperf_eventctrl_t. Returns a positive generation number if successful. _PPPP_IIII_OOOO_CCCC_SSSS_EEEE_TTTT_EEEE_VVVV_CCCC_TTTT_RRRR_LLLL - Modify how a program is using the counters, whether it be events and/or their associated mode of operation, or overflow threshold values, or overflow signal. Once the counters have been acquired this is how their operation for a program is modified without releasing the counters. Each time the PIOCSETEVCTRL is made the generation number for the target process's counters will be incremented. The parameter to this call is the address of an hwperf_profevctrarg_t. Returns a positive generation number if successful. _PPPP_IIII_OOOO_CCCC_RRRR_EEEE_LLLL_EEEE_VVVV_CCCC_TTTT_RRRR_SSSS - Release the performance counters- the target process will not have any events counted after this call. Note that the virtual counters associated with the target may still be read as long as the process has not exited. No parameters are necessary. _PPPP_IIII_OOOO_CCCC_SSSS_AAAA_VVVV_EEEE_CCCC_CCCC_NNNN_TTTT_RRRR_SSSS - Allow a parent process to receive the counter values of one of its children when it exits, without having to wait for the child (when the parent is waiting no explicit call is necessary). When the child exits its counter values will be added to the parent's, whether the parent is using its counters or not. No parameters are necessary other than target pid. EEEEXXXXAAAAMMMMPPPPLLLLEEEE An example of how these commands would be used is given here. Suppose that we wanted to count instruction cache misses and data cache misses for our own program. That means that we want to count event 9 for both counters, and these events would be counted in user mode. The following code would accomplish this. Note that the constants used are defined in /usr/include/sys/hwperfmacros.h, and evctr_args is an hwperf_profevctrarg_t. PPPPaaaaggggeeee 11114444 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) pid = getpid(); sprintf(pfile, "/proc/%05d", pid); fd = open(pfile, O_RDWR); for (i = 0; i < HWPERF_CNTEVENTMAX; i++) { if (i == 9) { evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_creg.hwp_mode = HWPERF_CNTEN_U; evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_creg.hwp_ie = 1; evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_creg.hwp_ev = i; evctr_args.hwp_ovflw_freq[i] = 0; } else { evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_spec = 0; evctr_args.hwp_ovflw_freq[i] = 0; } } for (i = HWPERF_CNT1BASE; i < HWPERF_EVENTMAX; i++) { if (i == 9) { evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_creg.hwp_mode = HWPERF_CNTEN_U; evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_creg.hwp_ie = 1; evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_creg.hwp_ev = i - HWPERF_CNT1BASE; evctr_args.hwp_ovflw_freq[i] = 0; } else { evctr_args.hwp_evctrargs.hwp_evctrl[i].hwperf_spec = 0; evctr_args.hwp_ovflw_freq[i] = 0; } } evctr_args.hwp_ovflw_sig = 0; generation1 = ioctl(fd, PIOCENEVCTRS, (void *)&evctr_args); if (generation1 < 0) { perror("failed to acquire counters"); exit errno; } . . . . . (body of program) . . . . /* now read the counter values */ if ((generation2 = ioctl(fd, PIOCGETEVCTRS, (void *)&cnts)) < 0) { perror("PIOCGETEVCTRS returns error"); exit(errno); } /* generation number should be the same */ if (generation1 != generation2) { printf("program lost event counters0); exit 0; } /* release the counters */ PPPPaaaaggggeeee 11115555 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) if ((ioctl(fd, PIOCRELEVCTRS)) < 0) { perror("prioctl PIOCRELEVCTRS returns error"); exit(errno); } /* print out the counts */ printf("instruction cache misses: %d/0, cnts.hwp_evctr[9]); printf("data cache misses: %d/0, cnts.hwp_evctr[25]); exit 0; Syssgi Commands for the Performance Counters The syssgi commands that access the event counters are not intended for general use. Rather, specialized commands are implemented through this interface. Note that all the commands are the first argument to the syssgi command SGI_EVENTCTR. The available commands are: _HHHH_WWWW_PPPP_EEEE_RRRR_FFFF______PPPP_RRRR_OOOO_FFFF_EEEE_NNNN_AAAA_BBBB_LLLL_EEEE - Enable sprofil-like profiling using the performance counters rather than the clock. Returns EINVAL on incorrect input, or EBUSY if the counters are already in use in system mode. The second argument to this command is the address of an hwperf_profevctrarg_t, the argument is a profp, the fourth is the profcnt, both referring to input necessary for profiling. Returns a positive generation number if successful. _HHHH_WWWW_PPPP_EEEE_RRRR_FFFF______EEEE_NNNN_SSSS_YYYY_SSSS_CCCC_NNNN_TTTT_RRRR_SSSS - Forcibly acquire the counters in system mode. ROOT PERMISSIONS ARE REQUIRED FOR THIS COMMAND. Note that the counters must be set up in kernel mode (usr and EXL may be included, but kernel mode is required), EINVAL will be returned otherwise. That is, at least one of the events must be counted in kernel mode. Will fail with EBUSY if the counters are already in use in system mode. Otherwise, the command is guaranteed to return the counters in system mode. Starts up the counters on all the cpus, with all the cpus counting the same events. Takes as input (third parameter of syssgi call) the address of an hwperf_profevctrarg_t, which is set up just as it is for the PIOCENEVENTCTRS (see example above). PPPPaaaaggggeeee 11116666 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) Returns a positive generation number if successful. _HHHH_WWWW_PPPP_EEEE_RRRR_FFFF______GGGG_EEEE_TTTT______SSSS_YYYY_SSSS_CCCC_NNNN_TTTT_RRRR_SSSS - Read the global system counters to get the global event counts. All of the per-cpu counters will be aggregated into the global counters and the results will be returned to the caller. Caller must supply in third argument the address of an hwperf_cntr_t. Returns a positive generation number if successful. _HHHH_WWWW_PPPP_EEEE_RRRR_FFFF______GGGG_EEEE_TTTT______CCCC_PPPP_UUUU_CCCC_NNNN_TTTT_RRRR_SSSS - Read a particular cpu's event counters. The third parameter is a cpuid, the fourth is the address of an hwperf_cntr_t. Returns a positive generation number if successful, 0 otherwise (which would indicate an invalid cpuid.) _HHHH_WWWW_PPPP_EEEE_RRRR_FFFF______GGGG_EEEE_TTTT______SSSS_YYYY_SSSS_EEEE_VVVV_CCCC_TTTT_RRRR_LLLL - Retrieve the control information for the systems event counters: which events are being counted and the modes they are being counted in. The third parameter must be the address of an hwperf_eventctrl_t. Returns EINVAL if the counters are not in use. Returns a positive generation number if successful. _HHHH_WWWW_PPPP_EEEE_RRRR_FFFF______SSSS_EEEE_TTTT______SSSS_YYYY_SSSS_EEEE_VVVV_CCCC_TTTT_RRRR_LLLL - Modify how the system counters are operating, whether it be events being counted and/or their associated mode of operation, or overflow threshold values, or overflow signal. MUST BE ROOT TO ISSUE THIS COMMAND, or else EPERM will be returned. Once the counters have been acquired this is how their operation is modified without releasing them. Each time the system call syssgi(SGI_EVENTCTR, HWPERF_SET_SYSEVCTRL,...) is issued the generation number for the system's counters is incremented. The third parameter to this call is the address of an hwperf_profevctrarg_t. Returns a positive generation number if PPPPaaaaggggeeee 11117777 PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) PPPPEEEERRRRFFFF____CCCCOOOOUUUUNNNNTTTTEEEERRRRSSSS((((5555)))) successful. _HHHH_WWWW_PPPP_EEEE_RRRR_FFFF______RRRR_EEEE_LLLL_SSSS_YYYY_SSSS_CCCC_NNNN_TTTT_RRRR_SSSS - Stop using the counters in system mode and to make the counters available again. ROOT PERMISSION REQUIRED. Returns 0 upon success. NNNNOOOOTTTTEEEESSSS The following list, ordered by events traced, details revision 3 of the R10000 CPU counters that return information different from the R12000 CPU counters. If an event is not listed here, it is the same on both CPU types. Event R10000 R12000 1 Issued instructions Decoded instructions 2 Issued loads Decoded loads 3 Issued stores Decoded stores 4 Issued store conditionals Decoded store conditionals 16 Cycles 17 Graduated instructions Data cache misses 30 Store/fetch exclusive to clean MHT entries FFFFIIIILLLLEEEESSSS ////uuuussssrrrr////iiiinnnncccclllluuuuddddeeee////ssssyyyyssss////hhhhwwwwppppeeeerrrrffffttttyyyyppppeeeessss....hhhh ////uuuussssrrrr////iiiinnnncccclllluuuuddddeeee////ssssyyyyssss////hhhhwwwwppppeeeerrrrffffmmmmaaaaccccrrrroooossss....hhhh SSSSEEEEEEEE AAAALLLLSSSSOOOO eeeeccccaaaaddddmmmmiiiinnnn(1M), eeeeccccssssttttaaaattttssss(1M), ppppeeeerrrrffffeeeexxxx(1M), lllliiiibbbbppppeeeerrrrffffeeeexxxx(3C), and lllliiiibbbbppppeeeerrrrffffeeeexxxx(3F). PPPPaaaaggggeeee 11118888